Examples:
We consider models in which predicted variable is an additive combination of predictors all of which have proportional influence on the prediction: multiple linear regression. Will also consider nonadditive combinations of predictions called interactions.
In context of GLM:
See http://www.indiana.edu/kruschke/BMLR/ for more details.
Direct expension to previous, although now have a coefficient for each predictor, each of which has a normal prior.
JAGS normalize all data:
data = "
data {
ym <- mean(y)
ysd <- sd(y)
for (i in 1:Ntotal) { # for each row of data
zy[i] <- (y[i] - ym) / ysd
}
for (j in 1:Nx) { # for each predictor
xm[j] <- mean(x[,j])
xsd[j] <- sd(x[,j])
for (i in 1:Ntotal) {
zx[i,j] <- (x[i,j] - xm[j]) / xsd[j]
}
}
}
"
model = "
model {
for (i in 1:Ntotal) {
zy[i] ~ dt(zbeta0 + sum(zbeta[1:Nx] * zx[i,1:Nx]), 1/zsigma^2, nu)
}
# Priors vague on standardized scale:
zbeta0 ~ dnorm(0,1/2^2)
for (j in 1:Nx) {
zbeta[j] ~ dnorm(0,1/2^2)
}
zsigma ~ dunif(1.0E-5, 1.0E+1)
nu <- nuMinusOne+1
nuMinusOne ~ dexp(1/29.0)
# Transform to original scale:
beta[1:Nx] <- (zbeta[1:Nx] / xsd[1:Nx])*ysd
beta0 <- zbeta0*ysd + ym - sum(zbeta[1:Nx]*xm[1:Nx] / xsd[1:Nx])*ysd
sigma <- zsigma*ysd
}
"
Use prior on regression coefficients with standard deviation of 2, because standardized regression coefficients constrained to fall between -1 and 1, so get something quite flat over that range.
\(R^2\) in traditional least-squares multiple regression is called proportion of variance accounted for because overall variance in y can be decomposed: \[\sum_i(y_i-\bar{y})^2 = \sum_i(y_i-\hat{y}_i)^2 + \sum_i(\hat{y}_i-\bar{y})^2\], and so \(R^2 = \sum_i(\hat{y}_i-\bar{y})^2 / \sum_i(y_i-\bar{y})^2\). \(\hat{y}_i\) is the linear prediction using coefficients that minimize \(\sum_i(y_i-\hat{y}_i)^2\).
In Bayesian analysis no such decomposition of variance occurs. Instead a surrogate of \(R^2\) is computed: \(R^2 = \sum_j\zeta_j r_{y,x_j}\) where \(\zeta_j\) is the standardized regression coefficient for the \(j\)th predictor at that step in the MCMC chain and \(r_{y,x_j}\) is the correlation of the predicted values \(y\) with the \(j\)th predictor values \(x_j\) (correlations are constants fixed by the data). This measure can exceed 1 or be less than 0. BB: is this really a good approximation/analogy to \(R^2\)? How similar does it get if we compare to traditional linear regression on the same data?
The vagueness of the prior on the regression coefficient has enormous influence on the inclusion probability; although the degree of vagueness has little influence on the estimate of the regression coefficient itself. Figure 18.14 demonstrates this effect clearly.
Reason for the lower probability of the more complex model is that each extra parameter dilutes the prior density on the pre-existing parameters (see Section 10.5). Models that include more parameters pay the cost of lower prior probability. Models with additional predictors are only favored to the extent that their benefit in higher likelihood outweights their cost in lower prior probability. Broader prior on regression coefficient means prior density at any particular value tends to be smaller.
Should spend be included or not? For the author, the non-zero explicit estimate of the regression coefficient trumps the model comparison.
Need to be very careful interpreting results of Bayesian model comparison because so strongly affected by vagueness of priors.
Code created for examples in this section does not scale well for larger applications:
Variable selection:
setwd("./DBDA2Eprograms")
source("DBDA2E-utilities.R") # Load definitions of graphics functions etc.
source("Jags-Ymet-XmetMulti-MrobustVarSelect-Example.R")
x <- seq(0, 20, length=100)
hx <- dgamma(x, shape=1.1051, rate = 0.1051)
degf <- c(1, 3, 8, 30)
colors <- c("red", "blue", "darkgreen", "gold", "black")
labels <- c("df=1", "df=3", "df=8", "df=30", "normal")
plot(x, hx, type="l", lty=2, xlab="x value",
ylab="Density", main="Gamma prior")
Change \(\sigma_{\beta}\) to \(10.0\) instead of it being estimated using a gamma prior.
Model probabilities are much different now; the favored model is now a different one, showing how sensitive the variable selection is to the width of the priors on the regression coefficients. A choice of \(\sigma_{\beta} = 2\) would have produced much more similar results. Either way \(\sigma_{\beta}\) is fixed at a single value which means the priors for all the regression coefficients have the same standard deviation which is fixed at this value; if this value is larger it allows greater credibility for coefficients to be further from 0.
Set \(\sigma_{\beta}\) back to gamma distribution, and now change prior on inclusion indices from delta[j] ~ dbern(0.5) to delta[j] ~ dbern(0.2). This means that a priori we would only expect 20% of the predictors to be included as opposed to 50%.
Definitely tending to favor models with smaller numbers of parameters.